Code
Document-feature matrix of: 2 documents, 5 features (20.00% sparse) and 0 docvars.
features
docs i like programming do not
text1 1 1 1 0 0
text2 1 1 1 1 1
Session 2️⃣: Going beyond bag-of-words: An introduction
Valerie Hase (LMU Munich)
Likely ❌ wrong assumption that:
Disassembling texts into tokens is the foundation for the bag of words model (bow)
Bow as simplified representation of text where only token frequencies are considered
Note. Figure from (Jurafsky & Martin, 2023, p. 60)
Can you come up with examples for when this assumption is violated? 🤔
Likely ❌ violated / not helpful when dealing with…
Have you learned about any methods that relax/do not rely on the bag-of-word assumption? 🤔
quanteda.corpora package (install directly via Github using devtools)ngram: sequence of n successive features in a corpus
Let’s check out examples from our corpus:
Tokens consisting of 1 document and 6 docvars.
Washington-1790 :
[1] "Fellow-Citizens_of" "of_the" "the_Senate"
[4] "Senate_and" "and_House" "House_of"
[7] "of_Representatives" "Representatives_:" ":_I"
[10] "I_embrace" "embrace_with" "with_great"
[ ... and 1,154 more ]
Keywords-in-context (KWIC) as a way of displaying concordandes, i.e., specific features and their context, as a type of ngrams.
Let’s remember how they work:
Keyword-in-context with 3 matches.
[Washington-1790, 49] Constitution of the | United | States ( of
[Washington-1790, 428] interests of the | United | States require that
[Washington-1790, 559] measures of the | United | States is an
Feature co-occurrence matrix of: 2 by 35,263 features.
features
features fellow-citizens of the senate and house
fellow-citizens 126 82580 129044 514 45185 443
of 0 38848591 121575105 476874 45895049 351852
features
features representatives : i embrace
fellow-citizens 636 415 4760 53
of 436949 915480 6438506 32278
[ reached max_nfeat ... 35,253 more features ]
How could we use these methods for social science questions? 🤔
Methods: Keywords-in-context, collocations, ngram-shingling (not discussed here)
Use for: Detecting text similarities, text reuse, stereotypical associations
Examplary studies:
Tutorials: Puschmann & Haim (2019), Schweinberger (2023a), Watanabe & Müller (2023)
Packages: quanteda, textreuse and related publication (Mullen, 2020)
We can also rely on information provided by syntax to better identify the meaning of language
Here, we will focus on two approaches:
Note. Figure from Jurafsky & Martin (2023, p. 164).
For explanation of tags, see De Marneffe et al. (2021).
spacyr package (but requires Python, installation somewhat complicated)udpipe packagelibrary("udpipe")
corpus_sotu %>%
#change format for udpipe package
as_tibble() %>%
mutate(doc_id = paste0("text", 1:n())) %>%
rename(text = value) %>%
#for simplicity, run for fewer documents
slice_head %>%
#part-of-speech tagging, include only related variables
udpipe("english") %>%
select(doc_id, sentence_id, token_id, token, upos) %>%
head(5) doc_id sentence_id token_id token upos
1 text1 1 1 Fellow ADJ
2 text1 1 2 - PUNCT
3 text1 1 3 Citizens NOUN
4 text1 1 4 of ADP
5 text1 1 5 the DET
Note. Figure from (Jurafsky & Martin, 2023, p. 381).
For explanation of tags, see De Marneffe et al. (2021).
spacyr package (but requires Python)udpipe packagelibrary("udpipe")
corpus_sotu %>%
#change format for udpipe package
as_tibble() %>%
mutate(doc_id = paste0("text", 1:n())) %>%
rename(text = value) %>%
#for simplicity, run for fewer documents
slice_head %>%
#dependency parsing, include only related variables
udpipe("english") %>%
select(doc_id, sentence_id, token_id, token, head_token_id, dep_rel) %>%
head(5) doc_id sentence_id token_id token head_token_id dep_rel
1 text1 1 1 Fellow 3 amod
2 text1 1 2 - 3 punct
3 text1 1 3 Citizens 0 root
4 text1 1 4 of 6 case
5 text1 1 5 the 6 det
rsyntax package, we can even plot this to better understand these relations!How could we use these methods for social science questions? 🤔
Methods: Part-of-speech tagging, dependency parsing
Use for: Detecting entities, entity-specific sentiment, sources, etc.
Examplary studies:
Packages:
Any questions? 🤔